Goto

Collaborating Authors

 cluster shape


Flexible Bivariate Beta Mixture Model: A Probabilistic Approach for Clustering Complex Data Structures

arXiv.org Artificial Intelligence

This unsupervised learning method is widely used in various applications, including image analysis, information retrieval, text analysis, bioinformatics, and many more [1, 2, 3, 4]. Clustering helps uncover the underlying structure of the data, facilitates data summarization, and sometimes serves as a preprocessing step for other algorithms [2]. Despite its widespread use, one of the primary challenges many traditional clustering algorithms face is that they often assume that the data points form clusters with convex shapes. For example, centroid-based algorithms like k -means and distribution-based models like Gaussian Mixture Models (GMM) typically produce clusters that are hyperspherical or ellipsoidal [5]. Although this assumption simplifies the clustering process, it restricts the flexibility of these models to handle complex data distributions that do not conform to convex shapes.


Multivariate Beta Mixture Model: Probabilistic Clustering With Flexible Cluster Shapes

arXiv.org Artificial Intelligence

Data clustering groups data points into components so that similar points are within the same component. Data clustering is commonly used for data exploration and is sometimes used as a preprocessing step for later analysis [1]. In this paper, the multivariate beta mixture model (MBMM), a new probabilistic model for soft clustering, is proposed. As the MBMM is a mixture model, it shares many properties with the Gaussian mixture model (GMM), including its soft cluster assignment and parametric modeling. In addition, the MBMM allows the generation of new (synthetic) instances based on a generative process. Because the beta distribution is highly flexible (e.g., unimodal, bimodal, straight line, or exponentially increasing or decreasing), MBMM can fit data with versatile shapes.


ShaRP: Shape-Regularized Multidimensional Projections

arXiv.org Artificial Intelligence

Projections, or dimensionality reduction methods, are techniques of choice for the visual exploration of high-dimensional data. Many such techniques exist, each one of them having a distinct visual signature - i.e., a recognizable way to arrange points in the resulting scatterplot. Such signatures are implicit consequences of algorithm design, such as whether the method focuses on local vs global data pattern preservation; optimization techniques; and hyperparameter settings. We present a novel projection technique - ShaRP - that provides users explicit control over the visual signature of the created scatterplot, which can cater better to interactive visualization scenarios. ShaRP scales well with dimensionality and dataset size, generically handles any quantitative dataset, and provides this extended functionality of controlling projection shapes at a small, user-controllable cost in terms of quality metrics.


repliclust: Synthetic Data for Cluster Analysis

arXiv.org Artificial Intelligence

Our approach is based on data set archetypes, high-level geometric descriptions from which the user can create many different data sets, each possessing the desired geometric characteristics. The architecture of our software is modular and object-oriented, decomposing data generation into algorithms for placing cluster centers, sampling cluster shapes, selecting the number of data points for each cluster, and assigning probability distributions to clusters.


K-expectiles clustering

arXiv.org Machine Learning

$K$-means clustering is one of the most widely-used partitioning algorithm in cluster analysis due to its simplicity and computational efficiency. However, $K$-means does not provide an appropriate clustering result when applying to data with non-spherically shaped clusters. We propose a novel partitioning clustering algorithm based on expectiles. The cluster centers are defined as multivariate expectiles and clusters are searched via a greedy algorithm by minimizing the within cluster '$\tau$ -variance'. We suggest two schemes: fixed $\tau$ clustering, and adaptive $\tau$ clustering. Validated by simulation results, this method beats both $K$-means and spectral clustering on data with asymmetric shaped clusters, or clusters with a complicated structure, including asymmetric normal, beta, skewed $t$ and $F$ distributed clusters. Applications of adaptive $\tau$ clustering on crypto-currency (CC) market data are provided. One finds that the expectiles clusters of CC markets show the phenomena of an institutional investors dominated market. The second application is on image segmentation. compared to other center based clustering methods, the adaptive $\tau$ cluster centers of pixel data can better capture and describe the features of an image. The fixed $\tau$ clustering brings more flexibility on segmentation with a decent accuracy.


Beyond 4D Tracking: Using Cluster Shapes for Track Seeding

arXiv.org Machine Learning

Analyzing data from the Large Hadron Collider (LHC) present a hyper challenge. A given collision event may result in hundreds of outgoing particles, each with many features (momentum, electric charge, etc.). This hyper variate phase space is then observed by complex multi-channel detectors that are essentially hyperspectral cameras. The LHC detectors have millions of readout channels and dimensionality reduction is essential for data analysis. One natural and nearly lossless reduction is the reconstruction of charged particle trajectories ('tracks'). The innermost layers of the detectors at the LHC are constructed to register the passage of charged particles without significantly altering the particle energy or direction. In the ATLAS and CMS detectors, this is achieved using silicon sensors that are finely segmented in one or two directions and are called strips and pixels, respectively. We will focus on pixels, although our methodology applies more generally. Typically, the first step in a tracking algorithm is the construction of seeds, which are sets of three or more hit pixel clusters that can be used to fit charged-particle trajectories (see e.g.


Principal Ellipsoid Analysis (PEA): Efficient non-linear dimension reduction clustering

#artificialintelligence

Even with the rise in popularity of over-parameterized models, simple dimensionality reduction and clustering methods, such as PCA and k-means, are still routinely used in an amazing variety of settings. A primary reason is the combination of simplicity, interpretability and computational efficiency. The focus of this article is on improving upon PCA and k-means, by allowing non-linear relations in the data and more flexible cluster shapes, without sacrificing the key advantages. The key contribution is a new framework for Principal Elliptical Analysis (PEA), defining a simple and computationally efficient alternative to PCA that fits the best elliptical approximation through the data. We provide theoretical guarantees on the proposed PEA algorithm using Vapnik-Chervonenkis (VC) theory to show strong consistency and uniform concentration bounds.